exp null 2
- Asia > China > Hong Kong (0.04)
- Oceania (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Spain > Canary Islands (0.04)
- North America > United States (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- North America > United States (0.04)
- Europe > France > Occitanie > Hérault > Montpellier (0.04)
- North America > United States (0.04)
- North America > Canada (0.04)
- Europe > France (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
- Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
- Europe > France (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
- Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
- (2 more...)
Price of universality in vector quantization is at most 0.11 bit
Harbuzova, Alina, Ordentlich, Or, Polyanskiy, Yury
Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ ("weight-only quantization''). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector quantization codebook with PCA directions of $X$ (a process known as "waterfilling allocation''). Dependence of the codebook on statistics of $X$, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$, in the sense of being at least as good as an $X$-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
Meterez, Alexandru, Nair, Pranav Ajit, Morwani, Depen, Pehlevan, Cengiz, Kakade, Sham
Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
- Asia > Middle East > Jordan (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)